feat(sink): support bigquery sink upsert #15780

xxhZs · 2024-03-19T08:18:35Z

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

#14882
refactor bigquery with storage write api. And support bigquery sink upsert
https://cloud.google.com/bigquery/docs/write-api

Checklist

I have written necessary rustdoc comments
I have added necessary unit tests and integration tests
I have added test labels as necessary. See details.
I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
All checks passed in ./risedev check (or alias, ./risedev c)
My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)

My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

In bigquery sink, Support upsert,
Users need to set corresponding permissions and pk based on the document in bigquery
https://cloud.google.com/bigquery/docs/change-data-capture?hl=zh-cn

wenym1 · 2024-03-19T11:25:55Z

src/connector/src/sink/encoder/proto.rs

@@ -364,18 +409,61 @@ fn encode_field<D: MaybeData>(
                    Ok(Value::Message(message.transcode_to_dynamic()))
                })?
            }
+            (false, Kind::String) if is_big_query => {


It seems that the newly added custom_proto_type is only used to generate the is_big_query flag, and the flag is only used to control the logic when seeing different datatypes. If so, instead of adding this new parameter, we can just add new methods like on_timestamptz, on_jsonb ... to the MaybeData trait and then call the corresponding trait methods here. And then for bigquery we can implement its own MaybeData with new customized logic and then pass it to the encoder.

There's rw and proto type match , bigquery has some special matches (many of which are converted to string) that require is_big_query to determine whether the match holds or not

The newly added logic does not look like specially for bigquery. It's more like an extension of the original type compatibility and can be generalized for proto encoding used in sinks other than bigquery (cc @xiangjinwu ).

If so, we can remove the is_big_query flag and custom_proto_type and make it a general way of processing.

I agree it can be more general. It is better to introduce a TimestamptzHandlingMode and bigquery can just select one of the string formats, similar to the json encoder.

However it may not be another implementation of MaybeData (for bigquery). That trait is only meant to be implemented twice: once with type info alone (for validation), and once with concrete datum (for encoding). Given that we want to affect both validation and encoding here, it is supposed to be in encode_field here.

src/connector/src/sink/encoder/proto.rs

wenym1 · 2024-03-21T05:48:11Z

src/connector/src/sink/encoder/proto.rs

@@ -364,18 +409,61 @@ fn encode_field<D: MaybeData>(
                    Ok(Value::Message(message.transcode_to_dynamic()))
                })?
            }
+            (false, Kind::String) if is_big_query => {


The newly added logic does not look like specially for bigquery. It's more like an extension of the original type compatibility and can be generalized for proto encoding used in sinks other than bigquery (cc @xiangjinwu ).

If so, we can remove the is_big_query flag and custom_proto_type and make it a general way of processing.

remove index fix fix

wenym1 · 2024-04-09T05:00:35Z

src/connector/src/sink/encoder/proto.rs

+                                         * Group C: experimental */
+        },
+        DataType::Int16 => match (expect_list, proto_field.kind()) {
+            (false, Kind::Int64) => maybe.on_base(|s| Ok(Value::I64(s.into_int16() as i64)))?,


May also add support for casting to Int32 and Int16 by the way.

Cargo.lock

src/connector/with_options_sink.yaml

wenym1 · 2024-04-09T05:12:40Z

src/connector/src/sink/big_query.rs

+            message_descriptor,
+            writer_pb_schema: ProtoSchema {
+                proto_descriptor: Some(descriptor_proto),
+            },
        })
    }

    async fn append_only(&mut self, chunk: StreamChunk) -> Result<()> {


Do we still need this special append_only method? It is only a subset of upsert.

One difference is that append-only does not add the CHANGE_TYPE column

src/connector/src/sink/big_query.rs

fix

src/connector/src/sink/big_query.rs

wenym1 · 2024-04-09T10:22:53Z

src/connector/src/sink/big_query.rs

+    }
+
+    async fn get_auth_json_from_path(&self, aws_auth_props: &AwsAuthProps) -> Result<String> {
+        if let Some(local_path) = &self.local_path {


Maybe not related to this PR: can we possibly not use separate local_path and s3_path and use a single option path? The path type can be distinguished by the path prefix such as file:// and s3://?

Besides, bigquery is on google cloud platform while users should upload the auth file to S3. It looks strange to me. Does gcp have S3 compatible object store?

agreed, references to s3 or aws in general always confused me. Is there a gcp alternative?

Unfortunately, this file is necessary to connect to gcp, so it can't be saved at gcp, another alternative is to import the json from this file by means of the string

May provide an extra option for users to include the raw auth string in the options.

Can do it in a separate PR since the current PR is already large and this feature is independent to the current PR.

src/connector/src/sink/big_query.rs

fmt

wenym1

Rest LGTM.

wenym1 · 2024-04-11T07:06:14Z

src/connector/src/sink/big_query.rs

+    }
+
+    async fn get_auth_json_from_path(&self, aws_auth_props: &AwsAuthProps) -> Result<String> {
+        if let Some(local_path) = &self.local_path {


May provide an extra option for users to include the raw auth string in the options.

Can do it in a separate PR since the current PR is already large and this feature is independent to the current PR.

hzxa21 · 2024-04-12T04:00:57Z

Cargo.lock

+checksum = "c48abc8687f4c4cc143dd5bd3da5f1d7ef38334e4af5cef6de4c39295c6a3fd0"
+dependencies = [
+ "anyhow",
+ "arrow 50.0.0",


We introduce full arrow 50 dependencies here. We should verify whether it will greatly affect the compile time of DEBUG and RELEASE build. If yes, we may need to consider how to avoid this dependency given that we technically don't use arrow in google-cloud-bigquery.

test with my computer

branch time target size build num

release main 27m 24s 27173092317 1471

release current pr 28m 14s 27275146355 1476

debug main 4m 31s 48776880726 1413

debug current pr 4m 51s 48920738495 1418

The result looks good. Thanks for the efforts.

hzxa21

LGTM.

xxchan · 2024-09-30T13:03:42Z

src/connector/src/sink/big_query.rs

 use gcp_bigquery_client::Client;
+use google_cloud_bigquery::grpc::apiv1::bigquery_client::StreamingWriteClient;


Why are we using 2 bigquery clients? Can we only use one? (The former gcp_bigquery_client is quite old now)

xxhZs added 2 commits March 14, 2024 18:25

save

d7e4a39

support bigquery cdc

af66328

xxhZs requested a review from a team as a code owner March 19, 2024 08:18

github-actions bot added the type/feature label Mar 19, 2024

xxhZs changed the title ~~feat(sink): support bigquery upsert~~ feat(sink): support bigquery sink upsert Mar 19, 2024

Merge branch 'main' into xxh/support-bigquery-cdc

cb3c3d2

xxhZs force-pushed the xxh/support-bigquery-cdc branch 2 times, most recently from 5cc2579 to 2b2402e Compare March 19, 2024 09:49

xxhZs requested review from hzxa21 and wenym1 March 19, 2024 09:49

xxhZs added the user-facing-changes Contains changes that are visible to users label Mar 19, 2024

wenym1 reviewed Mar 19, 2024

View reviewed changes

wenym1 reviewed Mar 21, 2024

View reviewed changes

fix

eaba3b6

remove index fix fix

xxhZs force-pushed the xxh/support-bigquery-cdc branch from 2b2402e to eaba3b6 Compare March 21, 2024 07:14

xxhZs requested a review from xiangjinwu April 2, 2024 04:08

xxhZs and others added 5 commits April 2, 2024 15:20

add handling mode

9333ae2

Merge branch 'main' into xxh/support-bigquery-cdc

2b14926

Fix "cargo-hakari"

a1d6ac1

Merge branch 'main' into xxh/support-bigquery-cdc

693d8b1

fix

4605ae0

xxhZs requested a review from wenym1 April 9, 2024 05:06

wenym1 reviewed Apr 9, 2024

View reviewed changes

save

11f2ffb

fix

xxhZs force-pushed the xxh/support-bigquery-cdc branch from d112c89 to 11f2ffb Compare April 9, 2024 09:10

wenym1 reviewed Apr 9, 2024

View reviewed changes

fix

bd5181b

docteurklein reviewed Apr 10, 2024

View reviewed changes

src/connector/src/sink/big_query.rs Outdated Show resolved Hide resolved

fix upsert

26e4cff

fmt

wenym1 approved these changes Apr 11, 2024

View reviewed changes

xxhZs enabled auto-merge April 11, 2024 07:17

xxhZs disabled auto-merge April 11, 2024 07:17

hzxa21 reviewed Apr 12, 2024

View reviewed changes

Merge branch 'main' into xxh/support-bigquery-cdc

c3edee7

xxhZs requested a review from hzxa21 April 19, 2024 03:05

hzxa21 approved these changes Apr 23, 2024

View reviewed changes

Merge branch 'main' into xxh/support-bigquery-cdc

5c98a4e

xxhZs enabled auto-merge April 25, 2024 05:51

xxhZs added this pull request to the merge queue Apr 25, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 25, 2024

Merge branch 'main' into xxh/support-bigquery-cdc

c07b3a6

xxhZs enabled auto-merge April 25, 2024 06:47

xxhZs disabled auto-merge April 25, 2024 07:06

xxhZs enabled auto-merge April 26, 2024 04:15

Merge branch 'main' into xxh/support-bigquery-cdc

65dc3ab

xxhZs added this pull request to the merge queue Apr 26, 2024

Merged via the queue into main with commit 2dcd0f6 Apr 26, 2024
27 of 28 checks passed

xxhZs deleted the xxh/support-bigquery-cdc branch April 26, 2024 06:07

BugenZhao mentioned this pull request May 21, 2024

risingwave 1.9.1 risingwavelabs/homebrew-risingwave#38

Closed

1 task

xxchan reviewed Sep 30, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(sink): support bigquery sink upsert #15780

feat(sink): support bigquery sink upsert #15780

xxhZs commented Mar 19, 2024 •

edited by xiangjinwu

Loading

wenym1 Mar 19, 2024

xxhZs Mar 20, 2024

wenym1 Mar 21, 2024

xiangjinwu Apr 2, 2024

wenym1 Mar 21, 2024

wenym1 Apr 9, 2024

wenym1 Apr 9, 2024

xxhZs Apr 9, 2024

wenym1 Apr 9, 2024

docteurklein Apr 10, 2024 •

edited

Loading

xxhZs Apr 10, 2024

wenym1 Apr 11, 2024

wenym1 left a comment

wenym1 Apr 11, 2024

hzxa21 Apr 12, 2024

xxhZs Apr 12, 2024 •

edited

Loading

hzxa21 Apr 23, 2024

hzxa21 left a comment

xxchan Sep 30, 2024

branch	time	target size	build num
release main	27m 24s	27173092317	1471
release current pr	28m 14s	27275146355	1476
debug main	4m 31s	48776880726	1413
debug current pr	4m 51s	48920738495	1418

		use gcp_bigquery_client::Client;
		use google_cloud_bigquery::grpc::apiv1::bigquery_client::StreamingWriteClient;

feat(sink): support bigquery sink upsert #15780

feat(sink): support bigquery sink upsert #15780

Conversation

xxhZs commented Mar 19, 2024 • edited by xiangjinwu Loading

What's changed and what's your intention?

Checklist

Documentation

Release note

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

docteurklein Apr 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wenym1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xxhZs Apr 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hzxa21 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xxhZs commented Mar 19, 2024 •

edited by xiangjinwu

Loading

docteurklein Apr 10, 2024 •

edited

Loading

xxhZs Apr 12, 2024 •

edited

Loading